10. PPO part 1- The Surrogate Function
PPO Part 1: The Surrogate Function
PPO Part 1: The Surrogate Function
Re-weighting the Policy Gradient
Suppose we are trying to update our current policy, \pi_{\theta'}. To do that, we need to estimate a gradient, g. But we only have trajectories generated by an older policy \pi_{\theta}. How do we compute the gradient then?
Mathematically, we could utilize importance sampling. The answer just what a normal policy gradient would be, times a re-weighting factor P(\tau;\theta')/P(\tau;\theta):
We can rearrange these equations, and the re-weighting factor is just the product of all the policy across each step -- I’ve picked out the terms at time-step t here. We can cancel some terms, but we're still left with a product of the policies at different times, denoted by "…".
Can we simplify this expression further? This is where proximal policy comes in. If the old and current policy is close enough to each other, all the factors inside the "…" would be pretty close to 1, and then we can ignore them.
Then the equation simplifies
It looks very similar to the old policy gradient. In fact, if the current policy and the old policy is the same, we would have exactly the vanilla policy gradient. But remember, this expression is different because we are comparing two different policies
The Surrogate Function
Now that we have the approximate form of the gradient, we can think of it as the gradient of a new object, called the surrogate function
So using this new gradient, we can perform gradient ascent to update our policy -- which can be thought as directly maximize the surrogate function.
But there is still one important issue we haven’t addressed yet. If we keep reusing old trajectories and updating our policy, at some point the new policy might become different enough from the old one, so that all the approximations we made could become invalid.
We need to find a way make sure this doesn’t happen. Let’s see how in part 2.